I am looking forward to the course for two reasons:
Another initial impression of the course to me that the moodle page is rather verbose and chaotic and this hampers finding the necessary information and the assigned tasks. I hope that I will get used to this and will not miss any compulsary assignment.
I have found the course in Sisu because I still needed one credit to my Transferable skills section.
## [1] "I've found the book interesting and well written. Naturally due to my previous experience in R, I was only skim-reading it. Nevertheless I have found some useful functionalities that I haven't used before. e.g."
filter(year == 2020 | disease == COVID19)gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp, fill = country)) +
geom_col(position="dodge") + theme(legend.position = "none")
percent() function in tidyverse,
it’s much simpler than sprintf(%2.1f%%).plotly and
html output of Rmarkdown one can create interactive htmls
(as one can hover over the diagram below to highlight countries).plot = gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, label = country)) +
geom_point()
ggplotly(plot)
Describe the work you have done this week and summarize your learning.
We got data on some sort of questionnaire responses from individuals and their results on some sort of test (exam). The questionnaire seems to refer to study habits. The origin of the data cannot be clearly identified from the metadata provided.
df = as_tibble(read.csv('http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt', sep = '\t'))
learning2014 = read_csv('data/learning2014.csv') %>%
rename(points = Points, attitude = Attitude)
The original raw data consists of 183 records and 60 variables. After summarizing some questions into groups (such as questions referring to in-depth studying into a “depth” group) by taking their mean and removing observations with 0 points the processed data consists of 166 observations and 7 variables. Information on the data structure can be found below.
str(learning2014)
## spec_tbl_df [166 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ Age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. Age = col_double(),
## .. Attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. Points = col_double()
## .. )
Let’s have a look at the distribution of variables in this data by plotting them pairwise on a scatter plot. The figure below provides detailed insight into this data. We suspect that there are differences in the male and female participants hence we stratify the data into these two groups by colors.
colMap = c("F" = "#FF5555", "M" = "1100EE")
# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, mapping = aes(color = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor")) ) +
scale_color_manual(values = colMap) +
scale_fill_manual(values = colMap)
We can read several pieces of information from the plot. First there were almost 1.5 times more females (red) in the study than males (blue) as can be seen by the barchart on the top-right. Based on the boxplots on the top the males attitude was higher towards statistics however there were no huge difference between the points of males and females based on gender. Nevertheless the positive correlation between attitude and points is observable for both males and females.
In a linear regression analysis we are searching for a linear relationship between a predictor (feature) variable and an expected target outcome. Let’s pick 3 that are the most likely candidates based on the previous plot:
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The results show that the attitude is significantly useful when predicting the exam points as the p-value for the slope of this variable is very low (\(1.93*10^{-8}\)). The F-statistic of the test also indicates that at least one of the selected variables has non-zero slope.
The t-test for each variable has the null-hypothesis that the slope of that variable in the regression equation is 0. If that is true then there is no association between the variable and the target. If the p-value is low then the result is unlikely under the null-hypothesis so usually we reject it i.e. we believe that there is association.
So fit the model again only with attitude (as that was the only one showing significance)
my_model <- lm(points ~ attitude, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The summary of this model shows that there is a positive correlation (with a slope of 3) between the attitude results and exam points. In lay terms this means that 1 unit increase in the attitude reflects more or less 3.5 points increase in the exam results. In addition to this there is a baseline of 11.6 for those with the lowest attitudes. The multiple R squared value gives the proportion of variation that is explained by the variables in the model. In this case it is 0.1906, so a little less than 20% of the variation is explained by the explanatory variable meaning that there are significant other factors affecting the exam results that we are omitting here.
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))
The linear regression model assumes that the observations are produced by a linear model from the explanatory variable and there is additive noise with normal distribution (0 expected value) to these. The residuals represent this normal distribution, hence it’s good to check how normal these are, and if the residual values are independent of the “x” location (in other case the noise would be dependent on the observation). The first two plots refer to this. On the first one we can see that the residuals are evenly distributed over the fitted values. The Q-Q plot shows that the quantiles of the residual distribution are very close to the quantiles of a normal distribution. The last “Residuals vs. Leverage” plot indicates which variables may have the greatest effect on the parameters (it is also known that linear regression is outlier-prone), hence it could be useful to check of the model improves a lot by removing those observations.
Beyond explaining the data linear regression models can be also used to predict unobserved results such as below.
new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
new_data <- data.frame(attitude = new_attitudes)
# Print out the new data
new_data
## attitude
## Mia 3.8
## Mike 4.4
## Riikka 2.2
## Pekka 2.9
# Predict the new students exam points based on attitude
predict(my_model, newdata = new_data)
## Mia Mike Riikka Pekka
## 25.03390 27.14918 19.39317 21.86099
As one can see it.
We got data on student performances at school, including Grades, number of previous class failures and connected background information such as family status, alcohol consumption etc. We had information from 2 classes: Math and Portugese. To create a joint dataset we round-averaged the grades from these 2 classes and included only students that were present in both classes (at least based on their attributes, the table does not contain unique student identifiers causing possible confusions).
# Double check the data wrangling result
df_dc = as_tibble(read.csv('https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv', sep = ','))
df = read_csv('data/student_joint.csv')
#full_join(df,df_dc)
# ok. this is of same size, e.g the tables must agree.
After these operations there are 370 students with 35 variables in the data.
str(df)
## spec_tbl_df [370 x 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ school : chr [1:370] "GP" "GP" "GP" "GP" ...
## $ sex : chr [1:370] "F" "F" "F" "F" ...
## $ age : num [1:370] 15 15 15 15 15 15 15 15 15 15 ...
## $ address : chr [1:370] "R" "R" "R" "R" ...
## $ famsize : chr [1:370] "GT3" "GT3" "GT3" "GT3" ...
## $ Pstatus : chr [1:370] "T" "T" "T" "T" ...
## $ Medu : num [1:370] 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : num [1:370] 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : chr [1:370] "at_home" "other" "at_home" "services" ...
## $ Fjob : chr [1:370] "other" "other" "other" "health" ...
## $ reason : chr [1:370] "home" "reputation" "reputation" "course" ...
## $ guardian : chr [1:370] "mother" "mother" "mother" "mother" ...
## $ traveltime: num [1:370] 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : num [1:370] 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ famsup : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ activities: chr [1:370] "yes" "no" "yes" "yes" ...
## $ nursery : chr [1:370] "yes" "no" "yes" "yes" ...
## $ higher : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ internet : chr [1:370] "yes" "yes" "no" "yes" ...
## $ romantic : chr [1:370] "no" "yes" "no" "no" ...
## $ famrel : num [1:370] 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : num [1:370] 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : num [1:370] 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : num [1:370] 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : num [1:370] 1 4 1 1 3 1 2 3 3 1 ...
## $ health : num [1:370] 1 5 2 5 3 5 5 4 3 4 ...
## $ failures : num [1:370] 0 1 0 0 1 0 1 0 0 0 ...
## $ absences : num [1:370] 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : num [1:370] 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : num [1:370] 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : num [1:370] 12 8 12 9 12 12 6 10 16 10 ...
## $ paid : chr [1:370] "yes" "no" "yes" "yes" ...
## $ alc_use : num [1:370] 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi [1:370] FALSE TRUE FALSE FALSE TRUE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. failures = col_double(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double(),
## .. paid = col_character(),
## .. alc_use = col_double(),
## .. high_use = col_logical()
## .. )
The 4 variables picked to examine if they have correlation with high alcohol consumption. Below I explain my a-priori expectations.
studytime: likely the less someone study the more time
is available for consuming alcohol. It is unlikely that someone would
use alcohol to mitigate stress coming from too much learning.activities: the more activities someone does the less
time they have for hanging out with bad student groups, I expect again
negative correlationhigher (meaning higher education plans): Student’s with
cleaner future planning are less likely to be heavy alcohol
consumers.freetime: probably students with too much free-time
will get more bored hence consumer more alcohol to make their life
exciting. p1 = df %>% ggplot() + aes(x = high_use, y = studytime, fill = high_use) + geom_violin()
p2 = df %>% ggplot() + aes(x = activities, fill = high_use) + geom_bar(position = "fill")
p3 = df %>% ggplot() + aes(x = higher, fill = high_use) + geom_bar(position = "fill")
p4 = df %>% ggplot() + aes(x = high_use, y = freetime, fill = high_use) + geom_violin()
grid.arrange(p1,p2,p3,p4)
m = glm(high_use ~ studytime + activities + higher + freetime, data = df, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ studytime + activities + higher + freetime,
## family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6471 -0.8668 -0.6585 1.1902 2.1362
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0841 0.7071 -0.119 0.905327
## studytime -0.5273 0.1577 -3.344 0.000826 ***
## activitiesyes -0.2283 0.2400 -0.951 0.341541
## higheryes -0.7545 0.5398 -1.398 0.162162
## freetime 0.3340 0.1246 2.680 0.007359 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 423.64 on 365 degrees of freedom
## AIC: 433.64
##
## Number of Fisher Scoring iterations: 4
The model suggests that the studytime and freetime variables are statistically significantly useful for predicting high-low alcohol usage. It is interesting that the higher education prediction is not significant, but I double checked the data, and it is severely skewed towards students planning higher education (16 students not planning while 354 planning). This means that this variable has not much affect on predicting the higher education planning but yet high alcohol consumers. The categorical variables (activities and higher) were converted internally to “dummy” representative levels. As there was only 2 levels in both cases only 1 dummy variable was created.
coef(m)
## (Intercept) studytime activitiesyes higheryes freetime
## -0.08410415 -0.52727634 -0.22825310 -0.75452928 0.33399121
exp(coef(m))
## (Intercept) studytime activitiesyes higheryes freetime
## 0.9193355 0.5902103 0.7959228 0.4702319 1.3965309
The coefficients represent the correlation between the variable and the output, the sign represents the shape of the logistic function, i.e. negative coefficient means that higher value (e.g. studytime) results in lower output (i.e. no high alcohol usage). In the exponential form these represent odds ratios for the unit change, in other words associating the variable’s importance in effecting the model. Values close to 1 would have no effect, deviations in either direction shows that the variable is important for the prediction.
confint(m)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.47694251 1.3160333
## studytime -0.84575068 -0.2261421
## activitiesyes -0.70031246 0.2420972
## higheryes -1.84750298 0.3028107
## freetime 0.09312641 0.5827670
These are 95% confidence intervals. It is usually understood that if these are not containing 0 then it is a significant result and indeed this matches with the significance table of the model. The results are matching with my hypothesis except for the higher education variable but that is explained by the skewing.
# Use only the significant variables
m2 = glm(high_use ~ studytime + freetime, data = df, family = "binomial")
probabilities <- predict(m2, type = "response")
df %<>% mutate(probability = probabilities)
df %<>% mutate(prediction = if_else(probability>0.5,TRUE,FALSE))
# tabulate the target variable versus the predictions
tt = table(high_use = df$high_use, prediction = df$prediction)
tt
## prediction
## high_use FALSE TRUE
## FALSE 250 9
## TRUE 102 9
This table is also called the confusion matrix. The usual metrics calculated from this table are the precision (TP/(TP+FP)) that is 0.5. This is pretty bad. Another common metric is recall (TP/(TP+FN)) that is 0.0810811 that is almost 0. Accuracy (the correct preddiictions / total predictions) was a bit better: 0.7. The training error is naturally 1-accuracy i.e. 0.3. This model performs pretty bad on predicting the high alcohol users, but pretty good on predicting the actually non high consumers. Simple guessing (without any other knowledge) should give a value around 0.5 hence this model is still a little better than random guess.